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Abstract 

The process of protein folding from an unfolded state to a biologically active, folded conformation 
is governed by many parameters e.g the sequence of amino acids, intermolecular interactions, the 
solvent, temperature and chaperon molecules. Our study, based on random matrix modeling of the 
interactions, shows however that the evolution of the statistical measures e.g Gibbs free energy, 
heat capacity, entropy is single parametric. The information can explain the selection of specific 
folding pathways from an infinite number of possible ways as well as other folding characteristics 
observed in computer simulation studies. 

PACS numbers: 87.15Cc, 87.15.hm 
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I. INTRODUCTION 



The expression of a gene in a DNA leads to formation of amino acids sequences that 
are the basic building blocks of proteins. The message contained in a DNA then manifest 
through a specific structure of protein which in turn determines its functionality. In fact, the 
protein after its birth, acts as a feedback and leads to creation of new copies of the parent 
DNA. 

The structure of a protein is determined purely by the amino acid sequences and its 
function depends on the ability of the protein to fold rapidly to its native structure [l, 3|. 
Based on numerous simulation studies of protein sequences (for example, see lMllj). the 
folding process is believed to reveal two main characteristics: (1) a single thermodynamically 
stable, minimum free energy state, (2) a very short time-scale for folding e.g. milliseconds to 
seconds. In past, there have been several analytical attempts to explain these observations 
(see for example [7, [12IJ16] ) . However a thorough understanding of the rapid and selective 
approach of a sequence to fold to a pre-determined configuration, despite availability of an 
infinite number of possibilities, is still missing (referred as protein folding problem). The 
three main components of the missing information are: (i) an understanding of the inter- 
atomic forces which lead to native state from an unfolded state, (ii) prediction of native 
structure from its amino acid sequences (usually requires a prior knowledge of inter-atomic 
forces), (iii) the origin/ reason of fast folding speed. We seek the information by a new 
analytical method based on the random matrix modeling of the interactions within 
protein as well as with its environment, and attempt to justify the findings of the simulation 
studies. 

The interactions among various units of a biological system are often complicated and 
can not be determined exactly. The complexity of the interactions manifests itself through 
sample to sample fluctuations of the properties. Such fluctuations (different from thermo- 
dynamic or statistical ones) have been observed in a wide range of complex systems and a 
useful information can be extracted only from the statistical analysis of properties 17| • For 
example, the microscopic energy states of complex systems like proteins are not well-defined 
and can at best be described by a statistical distribution. Previous analytical studies at- 
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tempted to circumvent this difficulty by averaging over the ensemble of protein sequences 
and, therefore, could not provide information about the role of a specific sequence Q on the 
folding. Our approach however is based on the averaging over the ensemble of interactions 
of a given sequence and does not suffer from this drawback. We analyze the interaction 
matrix i.e the matrix with its entries as the pairwise interactions between residues as well 
as their side-chains of a given sequence. The deterministic inaccuracy associated with the 
interactions results in their distribution (spread about some average value), with nature and 
degree of randomness governed by the local environment |l8[]. The interaction matrix then 
turns out to be a random matrix i.e a matrix with some/all random entries. The physical 
properties of such a matrix can be analyzed through their ensemble. 

The concept of randomization of local interactions is essentially same in spirit as the idea 
of randomization of microscopic energy states, used in well-known random energy model 



of disordered systems 



19j. The details and the information contained in random matrix 



model however is significantly different from that of the random energy model. The latter 
directly assumes a microscopic energy state to be Gaussian distributed, with all the system- 
specific information contained in its mean and variance. But the explicit dependence of 
mean and variance on the system parameters, e.g. pairwise interaction strengths, is not 
known which reduces the applicability of the model in probing the folding problem. Further, 
the assumption of randomness in this case requires presence of disorder. In contrast, the 
random matrix model, based on the inaccuracy led randomization of local interactions, 
depends on many parameters, each being a measure of local interaction-accuracy which in 
turn is sensitive to the system conditions. This leads to a multi-parametric distribution 
of the microscopic energy states which allows one to explore the effect of local variations 
on the sequence. Although our analysis finally leads to a single parametric formulation of 
the energy states, the parameter is a well-defined functional of the system conditions. This 
makes the model more appropriate for the analysis of various folding stages (each described 
by a set of the system parameters) . 

A protein in aqueous solution is in equilibrium between its native (folded) and denatured 
(unfolded) conformations. The thermodynamic stability of the native state is based on 
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the magnitude of the Gibbs free energy G of the system relative to unfolded state. A 
negative AG = G / — G u (subscripts f,u implying folded and unfolded state) indicates the 
native state is more stable than the denatured one. Many factors are responsible for the 
folding and stability of native proteins, e.g. hydrophobic interactions, hydrogen bonding, 
Van Der Waals forces and electrostatic interactions, conformational entropy, and the physical 
environment (pH, buffer, ionic strength, excipients etc.) 20M22| . The factors stabilizing the 
folded state are present in the unfolded state too and help in its stability. The folded state is 
however marginally more stable than the unfolded state due to various compensating factors 
enhancing its stability. Further the functionality and folding speed (to native conformation), 
instead of the thermodynamic stability, seem to be the main criteria for the selection of a 
natural protein conformation. Both of these characteristics require some degree of flexibility 
which in turn affects the free energy constraints on unfolding and refolding. These insights in 
the folding process are mostly based on computer simulation studies and it is desirable to seek 
an analytical understanding which could then help e.g in designing proteins. This motivates 
us to consider the partition function of a protein sequence which can be used to determine 
the stability measure i.e Gibbs free energy of the sequence in a specific conformation as well 
as the heat capacity and entropy of unfolding. 

During past few decades, the attempts to explain folding and organization of proteins 
from the unfolded or random coil state to the native folded state have put forward many 
ideas. It is now believed that the polarity of proteins and their hydrophobic interaction 
with the solvent dominate the folding process. The hydrophilic nature of polar amino acids 
in aqueous solution attracts polar water molecules while non-polar amino acids tend to be 
hydrophobic and prefer binding with each other. These tendencies along with other factors 
confine the space of available conformations and the folding occurs only through specific 
pathways. It appears to proceed from a restricted conformation ensemble by condensation 
and secondary structure formation through an even smaller ensemble of "molten globules" 
to a well-defined, three dimensional single structure [l, 2\. The final stages of folding are also 
believed to depend on the specific sequence of amino acids, whereas earlier stages should 
be mostly insensitive to the sequence-details. Further, molecules of the same protein can 
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follow different pathways to reach native state however the thermodynamic stability criteria 
(requiring decrease of free energy) restricts the allowed pathways. To understand these 
pathways, it is necessary to know the effect of varying residue-residue interactions as well as 
protein-solvent interactions on the thermodynamic properties. For this purpose, we analyze 
the energy distribution of a protein sequence under varying system conditions which leads 
to system-dependent formulation of thermodynamic measures. 

The paper is organized as follows. The section 2 describes the energy formulation for a 
microscopic state corresponding to a specific conformation. The random matrix model of 
the interactions, based on maximum entropy principle 23), is discussed in section 3 which 
is used in section 4 to obtain the energy landscape i.e the distribution of a microscopic state 
as a function of system and environmental conditions. This information is applied in section 
5 to derive the partition function and Gibbs free energy for folding. The heat capacity for 
denaturation and thermodynamic entropy are discussed in section 6. The section 7 contains 
concluding remarks. 



II. MICROSCOPIC ENERGY STATES OF A PROTEIN SEQUENCE 

A physicist's approach to folding problem is based on applying statistical energy functions 
to explore a large set of alternative structures of a target protein, with native state given 
by the lowest energy structure. An accurate description of the Gibb's free energy function 
needs to take into account the many body interactions among residues, (Hydrogen bonds, ion 
pairs, van Der Waals interactions, hydrophobic interactions) as well as effect of the solvent. 
Fortunately, however, a simplified version of energy function based on pairwise contact 
approximation has turned out to be quite a good description in many folding simulation 



studies 
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25) . Within this approximation, the energy of a particular conformation of a 
protein sequence of N residues can be expressed in terms of a N x N contact map matrix 
C whose matrix elements represent the pairwise contact potential: Consider a sequence 
A = (A\, A2, A3.. ..An), with Ak as the amino acid at the k th position in the chain, folds into 
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a structures whose contact map is C. The energy of the conformation can be given as 

E{C, A,U) = J2 C k iU kl (A k , A t ) = Tr [C.U] (1) 

kl 

with A k as the amino acid at the k th position in the chain. Here U is a N x N symmetric 
matrix with its elements £7jy = U(Aj t ,Ai) as the interaction between residues A k and Ai 
(present at position k and I of the sequence), where A k ,Ai belong to a set of the twenty 
types of amino acids. 

The contact matrix C contains information about the connectedness of the sequence. 
Based on the connectivity between two residues, the elements of the contact matrix are 
usually allowed to take binary values: 



Cij = 1 if residues k and 1 are connected 

= otherwise (2) 

The criteria for connectedness is usually considered to be the distance of the heavy atoms 
in the two residues: two residues are assumed to be in contact if any two heavy atoms 
belonging to them are closer than a threshold distance ( 1 — 10 Angstroms). 
The effective energy can be rewritten as 



E(C,A,U) = Tr[H] (3) 
where matrix H is the product of contact matrix C and interaction matrix U: 

N 

H kl = <3y U fl (4) 
i=i 

Eq.([3]) can be applied to derive P(E,C,u), the distribution of energy state E for a 
specific C matrix, or, the energy landscape for each state of protein e.g neutral, charged, 
folded, intermediate or unfolded (the energy of a protein being a function of the topological 
arrangement of the atoms) [^(J. An energy landscape depicts energy as a function of the 
conformation for a given state of protein. The stable conformation corresponds to the global 



minimum of the landscape, with its smooth, well-correlated structure indicating the stability 
of the protein [20 ] . 

The energy function given in eq.([3] ) is one of the most studied forms in computer sim- 
ulation studies of protein folding. Although this function is good enough for threading set 
simulations, it is believed to be not accurate enough to allow off-lattice folding simulations 
This motivated considerations of new energy functions e.g. THOM2 which captures 
the environment of each residue by assigning a potential energy U(A h Si a ) for each contact 



281 ] . The total energy of a protein in this case is a sum of the site 



Si a to a residue A h 
contributions: 

N m t 

1=1 i a =i 

where I = 1 — > N with iV as the total number of residue sites in the sequence, A\ a as the 
a th contact to the residue at site I, with l m as the total number of contacts to the site 1. 

The interactions between the side chains of various residues is very crucial to achieve the 3- 
dimensional structure of unique folded conformation. Such interactions are not taken into ac- 
count in eq.(j5]). This motivates us to consider a generalization of eq.(j5]). Let Ui k a (Ai a , A^) 
be the interaction strength between side chains Ai a and Ak a , the total energy of pairwise 
interactions is then 



m+l 

E(A,U) = J2Y, U ^ A ^ A ^ ( 6 ) 

k,l a=l 

Note, here the interaction between the residues is included in the sum by treating each 
residue as a side chain too. Due to side chain interactions, the size of the [/-matrix is now 
increased: N u = n^ =1 (im + -0- The missing/ weak connections among the side-chains of 
different residues, and mutually dependent pairwise interactions within a single side chain, 
may lead to an effectively sparse form of U matrix with many correlated elements. 

To proceed further, we need the information about the interactions among residues in 
the sequence as well as with solvent. In protein simulation studies, the information is 
usually taken from protein data bank. However, as discussed in the next section, the PDB 
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information is only approximately accurate and can be improved by taking the error into 
account i.e by considering the distribution of interaction strengths. The latter is then used 
to determine the distribution P(E) and the partition function. 



III. DISTRIBUTION OF INTERACTION STRENGTHS: A RANDOM MATRIX 
MODEL 



Consider the interaction matrix U of a protein sequence with N residues with its elements 
Uki describing the pairwise interaction between residues for a given set of system conditions. 
For notational simplification, henceforth, we denote Um by with p = {kl} as a single 
index (unless details required) which can take value from 1 — > M. Here M is the total 
number of the distinct matrix elements: M = N(N + l)/2. 

The presence of environment adds to the degree of complexity of the interactions in 
the chain. This renders an exact determination of technically difficult and they can be 
determined only within a certain degree of accuracy which, being sensitive to local system 
conditions, varies from element to element. Each U u can then be best described by a 

n 

distribution with parameters sensitive to system conditions (see [18]). 

Based on extent of available information about system conditions, the distribution of 
each Up can be obtained by invoking maximum entropy hypothesis 23| : in absence of any 
further information, the simplest and least biased hypotheses is that the system is described 
by the distribution p(U) that maximizes Shannon's information entropy S where 



S[p(U)} = - I p(U) lnp(U) dr (7) 

with T(U) as the invariant measure in the [/-space. For example, consider the system 
subjected to following constraints: (i) the probability density p(U) is conserved (normalized 
to unity), (ii) each is described by an independent, random distribution with its higher 
order (> 2) moments negligible, (iii) the mean < >= and 2 nd moment < U 2 >= v^+u 2 ^ 
are given by the system conditions. The maximization of Shannon entropy under these 
constraints leads to a Gaussian distribution of U^. 
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where u^, the ensemble averaged value of interaction, could be taken e.g. from a protein data 
bank. Note here assumed randomness of an interaction is different from considering a "ran- 
dom" sequence. The components of a sequence may be well-defined but their interactions 
may not be. 

The consideration of more realistic conditions e.g. many body interactions would intro- 
duce non-zero correlations among U^s: 

p(U,v) = C Yl exp [~v^ 2 {U^ - u^) (U^ - u m )} (9) 

with u W)/ i 2 as the measures of correlations between U^ 2 and Z7 M1 . However in present study 
we confine our analysis to the independent case. 

The Gaussian distributed (eq.©) leads to a Gaussian ensemble of if- matrix (from 
eq.®): 

M i _(h m - 6 m) 2 

PH(H,C,u) = l[-^=e-^- (10) 

with 

6 M = (H^ = C kj Uji, 

3 

= (Hi) ~ (H,) 2 = C 2 kj (*% + vl) - bl (11) 



i 

As clear, pn contains sequences with different interaction energies for a given contact map 
as well as sequences with different contact maps for a given interaction matrix U. 

The energy function in eq.([3]) being widely used in simulation studies, it is relevant to 
consider the energy distribution of a sequence modeled by the ensemble pn'- 

P(E, C,u) = J 8{E - Tr [H]) p H {H) dH (12) 

P(E,C,u) contains information about the energy landscape: the existence of a clear global 
minimum of P(E, C, u) in C-space for a fixed u (i.e a given protein sequence) indicates its 



foldability, with the neighborhood containing information about the low-energy alternative 
conformations. Note the above formulation also allows the possibility to consider a more 
generalized form of contact matrix. 

Eq.(j6]) being closer to realistic proteins, our main interest is to find P(E) for this case: 



with fi = {k a ,l a }, J^^Ufj, = J2k i a Uk a ,i a ) and p(U,v,u) as the density of the ensemble of 
[/-matrices, each of size N u . Assuming the matrix elements correlations negligible, it can 
again be described by eq.flBl) with now M = N U (N U + l)/2. 

Eq. (113j) can model various protein states e.g folded or unfolded. For example, the interac- 
tions between side chains in an unfolded state is much weaker in comparison to a folded state. 
The unfolded state can be described by eq.© by taking — > 0, -u M — > if p = {k a ,l a } 
is such that k ^ I and if A^ a and Ai a correspond to the side chains. For native state, 
a well-defined three-dimensional structure, a large number of u^s would be non-zero with 
corresponding very small. Similarly the intermediate folding states would correspond to 
varying (w^, t> M )-strengths, based on the sequence and its environment. The transition from 
unfolded to folded state can therefore be studied by a variation of these parameters. 

IV. EVOLUTION OF P(E) DURING FOLDING PROCESS 

Let us first consider the P(E) given by eq. (TT3T) . 

As the folding proceeds, the interaction strengths of residues with each other as well 
as with local environment change and the residues in the sequence rearrange themselves 
(dictated by their chemical nature and affinities). The folding therefore corresponds to 
dynamics of the elements and an evolution of p(U) in the [/-matrix space. 

The deterministic accuracy of each also fluctuates rapidly as the folding evolves, 
with different "time-scale" of fluctuations for each matrix element. This corresponds to a 
change of distribution parameters of the ensemble of the interaction strengths of a given 
sequence. The folding process can then be considered as an evolution of the ensemble in 
the parametric space. Both describing the same process, the parametric space dynamics of 




(13) 



10 



p(U, u, v) is therefore expected to mirror itself in its [/-space dynamics. This is indeed the 
case as can be seen by a partial differentiation of eq.flH]) with respect to (u^v^); a specific 
combination of the first order parametric variations turns out to be equivalent to a diffusion 
dynamics of along with a drift component: 



- 7 



dp 

^dx„. 



2x„^— + b 



dp 



d 



dU, 



M L 



9fi d 



2 dU, 



where x,, 



{2-6, 



g kl = l + 5 H with 6 M 



A' 

1 for k 



(14) 



I and for k I. 



Multiplication of both sides of eq. ( I14j) with factor S(E—^2 U^) and subsequent integration 
over [/-space gives, along with eq. (IT51) . 



^ dP_ _d_ 

1 ^dz,~ dE 
fj,=i ^ 
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P 



(15) 



with z^ = — ^lnflx^l |fyj 2 ), Mo as the number of non-zero parameters x^b^ and 7 as an 
arbitrary constant with units of E" 1 . 

As eq. ffl~5]) indicates, the combined effect of first order parametric variations is a diffusion 
of P(E) in the energy space. Due to linearity, these first order changes are additive in 
nature. The collective response of the sequence to these changes can then be mimicked by 
the response to a single parameter Y: 



OP 
W 



_d_ 

dE 



' 9 

dE + " E 



P 



(16) 



where Y is defined by the condition |£ 



X^=i Jj~ or > alternatively, 



M 

E 

H=l 



dY 

9z n 



(17) 



The above condition can easily be solved to give 



Y 



7M, 



1 M 

— Y 



8u Zu + Cq 



(18) 



with M = ^ and Co is a constant determined by the initial condition. Here a M are arbi- 
trary constants which can be fixed by physical considerations as follows. Eq.f JTB"]) describes 
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Y as a a weighted average of z^s, each representing local accuracy fluctuations. Assuming 
no particular bias of folding to any specific error, all a M s can be chosen equal. This gives 



y =-^nwwN 2 ]+co (19) 

here Yl implies a product over non-zero 6 M and and cq is a constant determined by 
the initial condition (i.e unfolded sequence). (A mathematically rigorous derivation of Y 



can be found in 
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29l|). Being a function of the system conditions governing folding 



e.g. interaction strengths as well as local environment, Y can be termed as the folding 
parameter. During folding, therefore, P undergoes a F-governed diffusion due to accuracy 
driven random forces, along with a finite drift caused by external forces e.g. environmental 
conditions. 

Eq. ffl6|) describes the flow of the probability P(E,Y\E ,Y ) from an arbitrary initial 
ensemble of the matrices Hq to a steady state (occurring in the limit — > 0); the steady 
state turns out to be a Gaussian free of any system conditions: P(E,Y — > oo) oc e -rE2//2 . 
For an arbitrary initial state P(E , Y ), eq. f TTBl) can be solved to give 

P{E, Y\E , Y ) = c exp[-a (E - a E ) 2 ] (20) 

with a = an-Qi) ) c — ^7f=f' an< i a = e _(T ~ y °). Let P(E ,Y ) represents the energy 
landscape of the denaturated state. The probability P(E, Y — Y ) for various intermediate 
stages between denaturated and native state can then be obtained by integrating eq. (|2U|) 
over P(E ,Y ): 



P{E; Y-Y ) = J P(E, Y\E , Y )P(E , Y )dE (21) 

As eq.( TE?|) indicates, Y increases as folding proceeds; this is due to increasing contributions 
from non-zero u, v parameters. 

P(E, Y — Y ), given by eq.( l2T|) . describes the energy landscape for a specific folding stage 
represented by the functional Y(u,v) which contains information about the system condi- 
tions prevailing during that stage. Thus, beginning from an unfolded sequence, the folding 
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process is governed by the collective influence (described by Y) of the local interactions 
(among residues as well as environment) on the protein dynamics. Different alternatives 
for pairwise interactions may result in different Y functions and therefore many trajectories 
originating from a given unfolded state. The thermodynamic conditions however restrict 
the choice of the folding trajectories. As discussed in the next section, Y dependence of 
P(E, Y — Yq) leads to F-governed evolution of the thermodynamic measures e.g Gibbs free 
energy G during folding. The thermodynamic stability criterion restricts the native state 
to occur along the trajectory with a well-defined global minimum of G occurring, say at 
Y = Yp. Due to its dependence on the value of Y and not on its functional form, G(Y) 
may take a same value at more than one trajectory. Thus folding occurs along trajectories 
with an approximately similar G(Y) behavior, leading to a common global minimum, say 
at Y — Yp. Existence of a local minimum for Y < Yp may inhibit the folding to a native 
state. Similarly a local minimum occurring for Y > Yp may lead to misfolding with changing 
environmental conditions. 

As eq. ()2ip shows, different energy landscapes of initial sequences may lead to different 
native states. However if mutations of some of the residues leave P(Eq, Yq) of an unfolded 
sequence unchanged, the native state then will also remain unaffected; this is in agreement 
with observed robustness of the native state to sequence-mutations. 

The initial ensemble, that is, the ensemble of unfolded or fully denatured protein is a linear 
sequence of residues with no secondary of tertiary structure, often existing as a random coil 
where all conformations have comparable energies. P{E , Yq) in this case can be described 
by a Gaussian distribution: 

p < B °' y °> = vsf ^ (22) 

As clear from their functional forms, eq. (ll4f) is also valid for pn{H) (eq. lfTOl) ). after 
replacing t> M -)■ u^b^ ->■ u^Y^k,i,a ~^ J2k,i and P( U ) ~^ Ph{H). Consequently, eq.dHI 
describes the the evolution of P{E), given by eq.f ll2p . too, with corresponding changes in 
Y. 
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V. PARTITION FUNCTION AND GIBBS FREE ENERGY G 



Eq. (j2ip for P(E;Y — Y ) can now be used to obtain the partition function Z for the 
conformation ensemble described by the complexity parameter Y: 



Z(Y-Y ,T) = J e~P E P(E, Y — Y ) dE (23) 
= ^e£z (Y ,r) (24) 

where r = T/a, (3 = 1/kT and 

Z (Y ,r) = j e-^ Eo P(E ,Y )dE (25) 

The free energy of the conformation at temperature T is then given by 

G(Y-Y ,T) = -kT\nZ 

= a G (t) - - kT ln(V^) (26) 

where Go(r) is the free energy of the unfolded state at rescaled temperature r: 

G (t) = -kr\nZ (27) 



As eq.(j26j) implies, the evolution of G at a given temperature T is dictated by a, and 
therefore, Y, a function of system conditions through {u^} and {v^}. The influence of 
system conditions on G can then be studied through Y. 

The stability of a conformation increases as its G decreases relative to that of the unfolded 
protein. The thermodynamic stability criterion for folded conformation requires its free 
energy to be minimum. This can be achieved by seeking system conditions i.e a = ctf at 
fixed T for which 



or equivalently, 



dG. _ n d 2 G. 

^" T<0 (28) 



0a f + G (Pa f ) + a dG °g^ f) \T = 0. (29) 
14 



a}-^ |t - faf - 2G (f3a f ) < (30) 



and 

,dG] 

*'f 

Substitution of Go m eg. (1291) leads to complexity parameter for the thermodynamically 
stable conformation at a fixed temperature T: Yf = Y — ln(of). 

For example, for an unfolded sequence given by eq. (l22|) . eq. (l25l) and eq. (l2TI) give 

Zo(r)=e^^, G (r)=6-^-. (31) 

This on substitution in eqs.(l29| [30]) gives 

^^-W^? glT = d-^- (32) 

The native state can therefore occur only if e > 0, r\ > 1. 

As eq. fl26|) implies, the stability of a given conformation changes with temperature too. 
The temperature T m for maximum stability of a given conformation, with all other system 
conditions fixed, can be obtained by the condition = which gives 

■ 9G (t) 2 2 



2 ^ y [ a + fc/3 2 (l - a 2 ) - 2k log(V27r) = (33) 
or, alternatively, with Sq(t) = — 9G q^ \ a (entropy at a fixed a), 



2S (t) - k(3\l - a 2 ) + 2k log(V27r) = (34) 

By substituting So in the above equation, one can determine T m for a specific a i.e a 
sequence in a specific solvent. As eq. (|3"l]) suggests, the stability of a structure decreases if the 
temperature T > T m or T < T m ; this also agrees with the simulation studies. For example, 
for P(E Q ,Y ) given by eq.flSJI, ^o(r)U = Eq.® then gives T m = ( \^f ■ It is 

easy to check that < at T = T m , indicating a decreasing G(T) and therefore stability 

for T > T m or T < T m . 

Note a (through K ) depends on both, interactions within sequence as well as with 
the environment (through set {u,v}). Following our approach, the folding therefore occurs 
when the matrix v (or C for case eq. (fT2l) ). for a specific interaction matrix u, will satisfy 
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eq. (}29l) with Go of the unfolded state at a temperature T/a. The approach also explains the 
existence of specific folding pathways at a fixed temperature: Yp in the parametric space 
{ u ni v tj} is connected to Y through several paths however folding occurs along paths with 
relatively maximum stability (among all paths) for an intermediate state too. These folding 
paths correspond to minimum free energy change between any two intermediate points. 
Also note that Y (through {u, v}) is evolving with time the rapidity of which depends on 
the environment; the information may help in the determination of folding speed at a fixed 
temperature. 

The appearance of Go in eqs. (I26ti3^|) indicates that the information specifying the native 
structure as well as the pathway to attain that state is contained in the amino acid sequence 
of each protein. The presence of both a and T in these equations however reveals the de- 
pendence of folding process on environmental factors as well as various interactions among 
residues. Thus nearly identical amino acid sequences may not fold similarly if their envi- 
ronment is different. This is in agreement with the results obtained by simulation studies 
of proteins. 



VI. HEAT CAPACITY AND ENTROPY FOR DENATURATION 



Heat capacity C p (at constant pressure p), defined as 



C-?W\ -k^ m \ (35) 



is an important measure to study the dynamics of unfolding 2l| and the hydrophobic effect 
on the protein stability. 

As eq.(l29l) and eq. (JMl) indicate, the existence of a thermally stable native state depends 
on the specific relation of Go, Y and T . It may not be satisfied by a sequence under certain 
environmental conditions; the protein then will not fold into its biochemically functional 
form. Further a folded conformation may unfold or " denature" if changes in system condi- 
tions e.g temperatures, concentrations of solutes, pH conditions, mechanical forces, and the 
chemical denaturants result in violation of the eq. (|29l) or eq.( l34l) . The effect of all these 
changes on C p can be studied through its F-formulation (obtained from eq.( l35|) and eq.(l26l)): 
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CJt 



k(5 



2 r 



PK ■ a 



2 2 d G n dG , , 2 - 



(36) 



C p0 (r) + fc/3 2 (a 2 -l) (37) 

JO i 



with C p o(t) = kftl 9 \ p as the heat capacity of unfolded protein at temperature t = T/a 



and 0q = 1/&T. 

The unfolding primarily occurs due to exposure of side chains (e.g. non-polar groups), 
buried in the native state, to solvent. The folding is believed to be dominated by the polar 
groups binding helped by solvent. Both these process involve C p change; for a sequence going 
from a state "Y" to " Yf at temperature T, the change in specific heat AC P = C p {jf)— Cpiji) 
can be given as (from eq. ([3] 



AC P = C p0 (r f ) - Cpo(n) + k/3 2 (a} - « 2 ) (38) 

with = e~( Yk ~ Yo ^ for k = f,i. Due to positive and negative C p of hydration for apolar 
and polar groups, respectively, the sign of AC P can provide information about nature of 
solvation e.g polar or apolar, and folding/ unfolding/ misfolding etc. For example, for the 
unfolded state given by eq. (l22|) . 

AC P = k (3 2 (1 - r])(a 2 f - a 2 ) (39) 

Thus for unfolding, which corresponds to Yf < Yi or a.f > c^, one gets AC P > 0. The 
folding, with Yf > Yi, similarly corresponds to AC P < 0. 
The entropy 

BC 

S = k (\nZ + /3E) = k/3 2 — (40) 

is another important thermodynamic property commonly measured for proteins. A compe- 
tition of entropy with stabilizing forces determines the possibility of unfolding which occurs 
at temperatures when S becomes dominant. The F-dependence of S can be given as 



S{Y, T) = S {t) - (l/2)fc (3 2 (1 - a ) + k \nV2n (41) 
with 5*o (r) as the entropy of unfolded sequence at temperature r. 
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The entropy change AS* contains information about reversibility (AS < 0) or irreversibil- 
ity (AS > 0) of the process. Y- dependence of AS can then be used to determine the 
system-condition which can lead to refolding of a misfolded protein. For example, presence 
of chaperon molecules may change interaction parameters (referred by u) and therefore Y 
and AS. For a sequence changing from state Yi—>Yf, AS is 

AS = S f -S l = S (r f ) - S (r,) + (l/2)k(3 2 (a 2 - a 2 ). (42) 

For unfolded state given by eq.( l22l . we get 



AS = kf3 2 (l + r])(a 2 f -a 2 )/2 (43) 

which implies an increase of entropy for Yf < (unfolding) and a decrease of entropy for 
Y f > Yi (folding). 

The simulation studies suggest that the ratio of the entropy change, AS, to the heat ca- 
pacity change, AC P , for the dissolution of a variety of hydrophobic compounds is a constant. 
This is confirmed by our formulation too. The ratio can be determined from eq. (l35|) and 
eq-POj): 



AS d 2 , Z f 



ln^- (44) 
AC P <9/3 2 Zi 1 1 

It is easy to see, from eq. (l3U|) and eq. (l4"3"|) . that the ratio depends only on the properties of 
unfolded sequence: = 



VII. CONCLUSION 



To summarize, a protein sequence in general is described by a multi-parametric ensemble 
of interactions. Our study shows however that the thermodynamic properties of the sequence 
are governed by a single parameter (besides temperature) which is basically a measure of 
average uncertainty associated with the local interactions. The formulation provides an 
analytical understanding of some important observations obtained by computer simulation 
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studies of proteins e.g dependence of native state on original sequence, the role of solvent, 
decrease of stability of the native state above and below the critical temperature. The 
stability of folded sequence against mutations can also be explained by the F-dependence of 
free energy: a mutation may change the interaction parameter uu however Y F may remain 
unaffected (change being averaged out in the combination of interaction parameters). Such 
mutations will leave native state unaffected. The F-formulation also explains the selection of 
specific folding pathways among infinite number of possibilities and can be used to identify 
them. We have yet to apply it to many other simulation studies observations, for example, 
the observed preference to the functionality and folding speed , instead of stability, as the 
main criteria for selection of a natural protein conformation, studies on misfolding of proteins 
etc. 

The random matrix approach described here is applicable only for the cases of inter- 
action matrix with independent matrix elements; this takes into account only two-body 
interactions. In general, a protein is a complex system with many body interactions and 
consequently the interaction matrix contains correlated elements. The generalization of 
single parametric formulation to protein models with correlated matrix elements is very 
desirable; we intend to pursue some of these questions in future studies. 
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